Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

260 ◾ Bioinformatics

P S G

P v G

∏

(

)

(

)

(7.7)

Finally, the probability that an unknown query sequence S is a member of genus G is

given as

P G S

P S G

P G

P S

(

)

( )

( |

)

(7.8)

where P(G│S) is the posterior probability, P(G) is the class prior probability of a sequence

being a member of G, and P(S) is the predictor prior probability.

Since both class prior probability and predictor prior probability are constant for all

sequences, we can drop both P(G) and P(S). So, the above formula can be rewritten as

P G S

P S G

P v G

∏

(

)

(

)

(

)

(7.9)

Based on the naïve Bayesian, each representative sequence obtained from the denoising

step is assigned to the genus G or a taxonomic group that is giving the highest probability

score after calculating the score for each taxonomic group.

7.2.4 Construction of Phylogenetic Trees

A phylogenetic tree is a diagram that represents evolutionary relationships among organ-

isms. The pattern of branching in a phylogenetic tree describes how species or taxonomic

groups evolved from a series of common ancestors and how they are related. Two spe-

cies are more related if they have a more recent common ancestor and less related if they

have a less recent common ancestor. In amplicon-based metagenomic studies, phyloge-

netic trees based on targeted genes (e.g., 16S rRNA gene) are commonly used to describe

and compare the taxonomic groups of the bacterial community in the sample. There are

several tree-building algorithms implemented in software. The construction of the phylo-

genetic tree begins with computing genetic distances between the unique representative

sequences identified in the above steps. The genetic distance between two representative

sequences is the number of mutations or evolutionary events between two sequences since

their divergence. The traditional Hamming method calculates genetic distance between

any two sequences by counting the number of differences between them. However, this

method does not capture the evolutionary history of the sequences since not all muta-

tions are shown in the current sequences. The Hamming distance is corrected with Jukes–

Cantor logarithmic correction. Let p be the distance between two sequences, the distance

(d) after Jukes–Cantor correction is given as

−



^



^

4 ^{log 1}

(7.10)

The tree is then built from the pairwise distances of the representative sequences. Several

tree-building methods are available. The most popular distance-based methods are the